MAWR.SNP.filtering

0.1 load relevant R packages

library(vcfR) #v1.14.0

   *****       ***   vcfR   ***       *****
   This is vcfR 1.14.0 
     browseVignettes('vcfR') # Documentation
     citation('vcfR') # Citation
   *****       *****      *****       *****
library(ggplot2) #v3.4.1
library(adegenet) #v2.1.10
Loading required package: ade4

   /// adegenet 2.1.10 is loaded ////////////

   > overview: '?adegenet'
   > tutorials/doc/questions: 'adegenetWeb()' 
   > bug reports/feature requests: adegenetIssues()
library(SNPfiltR) #v1.0.1
This is SNPfiltR v.1.0.2

Detailed usage information is available at: devonderaad.github.io/SNPfiltR/ 

If you use SNPfiltR in your published work, please cite the following papers: 

DeRaad, D.A. (2022), SNPfiltR: an R package for interactive and reproducible SNP filtering. Molecular Ecology Resources, 22, 2443-2453. http://doi.org/10.1111/1755-0998.13618 

Knaus, Brian J., and Niklaus J. Grunwald. 2017. VCFR: a package to manipulate and visualize variant call format data in R. Molecular Ecology Resources, 17.1:44-53. http://doi.org/10.1111/1755-0998.12549
library(StAMPP) #v1.6.3
Loading required package: pegas
Loading required package: ape
Registered S3 method overwritten by 'pegas':
  method      from
  print.amova ade4

Attaching package: 'pegas'
The following object is masked from 'package:ape':

    mst
The following object is masked from 'package:ade4':

    amova
The following objects are masked from 'package:vcfR':

    getINFO, write.vcf

We will follow the SNP filtering protocol outlined by the SNPfiltR R package in [@deraad2022].

0.2 Read in data

We will read in the unfiltered SNPs for all samples output as a vcf file after using BWA [@li2009] to map our raw RADseq data to the Thryothorus ludovicianus (Carolina wren) reference genome (available at this NCBI link; [@feng2020]).

#read in vcf
vcfR <- read.vcfR("~/Desktop/marsh.wren.phylogeography/populations.snps.vcf")
#how many samples and SNPs did we get from mapping all samples to the CAWR reference genome?
vcfR
***** Object of Class vcfR *****
201 samples
5726 CHROMs
331,667 variants
Object size: 1239.9 Mb
75.72 percent missing data
*****        *****         *****
#read in sample info csv
sample.info<-read.csv("~/Desktop/marsh.wren.phylogeography/MAWR.phylogeography.full.sampling.csv")

#reorder sampling file to match order of samples in vcf
sample.info<-sample.info[match(colnames(vcfR@gt)[-1], sample.info$Sample),]
#make sure sampling file matches the order of samples in your vcf
sample.info$Sample == colnames(vcfR@gt)[-1]
  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[136] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[151] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[166] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[196] TRUE TRUE TRUE TRUE TRUE TRUE

1 Implement quality filters that don’t involve missing data

This is because removing low data samples will alter percentage/quantile based missing data cutoffs, so we wait to implement those until after deciding on our final set of samples for downstream analysis

#hard filter to minimum depth of 5, and minimum genotype quality of 30
vcfR<-hard_filter(vcfR=vcfR, depth = 3, gq = 30)
21.28% of genotypes fall below a read depth of 3 and were converted to NA
6.22% of genotypes fall below a genotype quality of 30 and were converted to NA

Use the function allele_balance() to filter for allele balance from Puritz SNP filtering tutorial “Allele balance: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous, we expect that the allele balance in our data (for real loci) should be close to 0.5”

#execute allele balance filter
vcfR<-filter_allele_balance(vcfR, min.ratio = .10, max.ratio = .90)
0.18% of het genotypes (0.01% of all genotypes) fall outside of 0.1 - 0.9 allele balance ratio and were converted to NA

We now want to implement a max depth filter (super high depth loci are likely multiple loci stuck together into a single paralogous locus, which we want to remove).

#visualize and pick appropriate max depth cutoff
max_depth(vcfR)
cutoff is not specified, exploratory visualization will be generated.

dashed line indicates a mean depth across all SNPs of 29.4

#filter vcf by the max depth cutoff you chose
vcfR<-max_depth(vcfR, maxdepth = 250)
maxdepth cutoff is specified, filtered vcfR object will be returned
2.65% of SNPs were above a mean depth of 250 and were removed from the vcf

Remove SNPs from the vcfR that have become invariant following the removal of questionable genotypes above, and see how many SNPs we have left after this initial set of filters

vcfR<-min_mac(vcfR, min.mac = 1)
27.48% of SNPs fell below a minor allele count of 1 and were removed from the VCF

vcfR
***** Object of Class vcfR *****
201 samples
5330 CHROMs
234,156 variants
Object size: 882.3 Mb
80.5 percent missing data
*****        *****         *****

2 Setting the missing data by sample threshold

#run function to visualize per sample missingness
miss<-missing_by_sample(vcfR)
No popmap provided

#run function to visualize per SNP missingness
by.snp<-missing_by_snp(vcfR)
cutoff is not specified, exploratory visualizations will be generated
Picking joint bandwidth of 0.0333
Warning: Removed 201 rows containing non-finite values
(`stat_density_ridges()`).

Warning: Removed 1 rows containing missing values (`geom_point()`).

Based on these visualizations, we will want to drop the worst sequenced samples that are dragging down the rest of the dataset. Drop those samples based on an approximate missing data proportion cutoff here (this can always be revised if we end up feeling like this is too lenient or stringent later):

#run function to drop samples above the threshold we want from the vcf
vcfR.trim<-missing_by_sample(vcfR=vcfR, cutoff = .9)
16 samples are above a 0.9 missing data cutoff, and were removed from VCF

#remove invariant sites generated by sample trimming
vcfR.trim<-min_mac(vcfR.trim, min.mac = 1)
0.6% of SNPs fell below a minor allele count of 1 and were removed from the VCF

3 Setting the missing data by SNP threshold

Now we will visualize different per SNP missing data thresholds and identify a value that optimizes the trade-off between amount of missing data and the total number of SNPs retained.

#see what effect trimming samples had on missing data across the dataset
by.snp<-missing_by_snp(vcfR.trim)
cutoff is not specified, exploratory visualizations will be generated
Picking joint bandwidth of 0.051

vcf.85<-missing_by_snp(vcfR.trim, cutoff=.85)
cutoff is specified, filtered vcfR object will be returned
90.77% of SNPs fell below a completeness cutoff of 0.85 and were removed from the VCF

miss<-missing_by_sample(vcf.85)
No popmap provided

#no samples would have > 50% missing data at an 85% completeness cutoff, so I am happy with that

vcf.90<-missing_by_snp(vcfR.trim, cutoff=.9)
cutoff is specified, filtered vcfR object will be returned
92.45% of SNPs fell below a completeness cutoff of 0.9 and were removed from the VCF

miss<-missing_by_sample(vcf.90)
No popmap provided

#no samples would have > 40% missing data at an 90% completeness cutoff

Investigate the effect of missing data

#I will now investigate and make sure sample clustering isnt being obviously driven by missing data
#make a splitstree for the 90% completeness dataset
#convert to genlight
gen<-vcfR2genlight(vcf.90)
#assign populations (a StaMPP requirement)
gen@pop<-as.factor(gen@ind.names)
gen@ind.names<-gsub("2511-03326_rep2","R2_03326",gen@ind.names)
#generate pairwise divergence matrix
sample.div <- stamppNeisD(gen, pop = FALSE)
#export for splitstree
stamppPhylip(distance.mat=sample.div, file="~/Desktop/marsh.wren.phylogeography/90.splits.txt")

#filter to 100% completeness and make a splitstree
vcf.100<-missing_by_snp(vcfR.trim, cutoff = 1)
cutoff is specified, filtered vcfR object will be returned
99.37% of SNPs fell below a completeness cutoff of 1 and were removed from the VCF

#convert to genlight
gen<-vcfR2genlight(vcf.100)
#assign populations (a StaMPP requirement)
gen@pop<-as.factor(gen@ind.names)
gen@ind.names<-gsub("2511-03326_rep2","R2_03326",gen@ind.names)
#generate pairwise divergence matrix
sample.div <- stamppNeisD(gen, pop = FALSE)
#export for splitstree
stamppPhylip(distance.mat=sample.div, file="~/Desktop/marsh.wren.phylogeography/100.splits.txt")

90% complete splitstree (~18K SNPs)

100% complete splitstree (~1,500 SNPs)

We can see that the percentage of missing data is not causing the intermediate clustering, specifically for the one obvious hybrid sample. Additionally, the samples from North Dakota and Saskatchewan are consistently leaking toward the middle, even when there is no missing data in the input SNP matrix, indicating that these samples have real mismatched alleles, not just excess missing data and insufficient signal to assign them.

4 Remove technical replicates

We can also see that for the one sample for which both replicates passed filtering requirements (2511-03326), the two technical replicates are nearly identical in our 90% complete splitstree

In fact, we can see exactly how many genotypes were called the same between the two samples, as an estimate of error rate

#calculate number of conflicting genotype calls between technical replicates
table(gsub(":.*","",(vcf.90@gt[,colnames(vcf.90@gt) == "2511-03326_rep1"])) == gsub(":.*","",(vcf.90@gt[,colnames(vcf.90@gt) == "2511-03326_rep2"])))

FALSE  TRUE 
    3 17438 
#isolate the exact genotypes that conflict
z<-gsub(":.*","",(vcf.90@gt[,colnames(vcf.90@gt) == "2511-03326_rep1" | colnames(vcf.90@gt) == "2511-03326_rep2"]))[gsub(":.*","",(vcf.90@gt[,colnames(vcf.90@gt) == "2511-03326_rep1"])) != gsub(":.*","",(vcf.90@gt[,colnames(vcf.90@gt) == "2511-03326_rep2"])),]
#print them with missing values removed
z[complete.cases(z), ]
     2511-03326_rep1 2511-03326_rep2
[1,] "0/1"           "0/0"          
[2,] "0/0"           "0/1"          
[3,] "0/1"           "1/1"          

Only 3 genotypes differ between these two replicates across > 17K called SNPs. This suggests very low (~0.017%) variability in genotype calls in our final dataset due to non-biological reasons. Looking at the genotypes, the variation is exclusively in genotypes where the caller disagreed about whether to call a homozygous versus het genotype, the types of calls where confidence can vary depending on read depth and random sampling error of the two alleles. This is overall a very encouraging sign for the quality of this dataset.

We will now remove the technical replicate with the most missing data

vcf.90<-vcf.90[,colnames(vcf.90@gt) != "2511-03326_rep2"]
#remove any potentially invariant SNPs
vcf.90<-min_mac(vcf.90, min.mac = 1)
0.02% of SNPs fell below a minor allele count of 1 and were removed from the VCF

5 Remove overlapping SNPs

It is a known thing (see this) that Stacks will not merge SNPs called twice if they are sequenced by separate (but physically overlapping) loci. To account for this, we will simply remove a SNP every time its chromosome and position have already been seen in the dataset with the following code:

#generate dataframe containing information for chromosome and bp locality of each SNP
df<-as.data.frame(vcf.90@fix[,1:2])
#calc number of duplicated SNPs to remove
nrow(df) - length(unique(paste(df$CHROM,df$POS)))
[1] 7
#remove duplicated SNPs
vcf.90<-vcf.90[!duplicated(paste(df$CHROM,df$POS)),]

6 Visualize depth and quality across all retained genotypes

#plot depth per snp and per sample
dp <- extract.gt(vcf.90, element = "DP", as.numeric=TRUE)
heatmap.bp(dp, rlabels = FALSE)

#plot genotype quality per snp and per sample
gq <- extract.gt(vcf.90, element = "GQ", as.numeric=TRUE)
heatmap.bp(gq, rlabels = FALSE)

7 linkage filter

Now filter for linkage (thin dataset to maximum 1 SNP per Kb)

#perform linkage filtering to get a reduced vcf with only unlinked SNPs
vcfR.thin<-distance_thin(vcf.90, min.distance = 1000)

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |==                                                                    |   4%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |=========                                                             |  14%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==============                                                        |  21%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |================                                                      |  24%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=====================                                                 |  31%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |=======================                                               |  34%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |============================                                          |  41%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |==============================                                        |  44%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |===================================                                   |  51%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |=====================================                                 |  54%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |==========================================                            |  61%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |============================================                          |  64%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |=================================================                     |  71%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |===================================================                   |  74%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |==========================================================            |  84%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |=================================================================     |  94%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================|  99%
  |                                                                            
  |======================================================================| 100%
2910 out of 17554 input SNPs were not located within 1000 base-pairs of another SNP and were retained despite filtering

8 write out vcf for downstream analyses

#get info for all SNPs passing filtering vcf dataset
vcf.90
***** Object of Class vcfR *****
184 samples
1441 CHROMs
17,554 variants
Object size: 260.1 Mb
4.205 percent missing data
*****        *****         *****
#write to disk
vcfR::write.vcf(vcf.90, file = "~/Desktop/marsh.wren.phylogeography/filtered.snps.vcf.gz")

#get info for the thinned SNP dataset
vcfR.thin
***** Object of Class vcfR *****
184 samples
1441 CHROMs
2,910 variants
Object size: 42.6 Mb
4.531 percent missing data
*****        *****         *****
#write to disk
vcfR::write.vcf(vcfR.thin, file = "~/Desktop/marsh.wren.phylogeography/filtered.unlinked.snps.vcf.gz")